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ABSTRACT 


The communication and synchronization overhead inherent in parallel processing can lead to 
situations where adding processors to the solution method actually increases execution time. 
Problem type, problem size, and architecture type all affect the optimal number of processors to 
employ. In this paper we examine the numerical solution of an elliptic partial differential equa- 
tion in order to study the relationship between problem size and architecture. The equation’s 
domain is discretized into n 2 grid points which are divided into partitions and mapped onto the 
individual processor memories. We analytically quantify the relationships between grid size, 
stencil type, partitioning strategy, processor execution time, and communication network type. 
In doing so, we determine the optimal number of processors to assign to the solution (and hence 
the optimal speedup), and identify (1) the smallest grid size which fully benefits from using all 
available processors, (2) the leverage on performance given by increasing processor speed or com- 
munication network speed, (3) the suitability of various architectures for large numerical prob- 
lems. 


This research was supported by the National Aeronautics and Space Administration under NASA 
Contract Number NASl-18107 while the author was in residence at ICASE, NASA Langley Research 
Center, Hampton, VA 23665. 
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1. Introduction 

A numerical solution to an elliptic partial differential equation (PDE) is usually constructed by 
modeling the continuous domain of the equation’s variables with a grid of discrete points. The partial 
derivatives are approximated using some differencing scheme, and a linear set of equations is con- 
structed whose unknowns are the values of the solution function at each of the grid points. During an 
iterative solution of these equations (e.g. point Jacobi) the value at a point is approximated by a func- 
tion of values at nearby points. The amount of computational work associated with updating an interior 
grid point is the same throughout the grid. Furthermore, during a single iteration grid points can be 
updated in parallel. This high degree of regularity and potential parallelism has made the solution of 
PDEs a very attractive problem area for the application of parallel processing. 

An elliptic PDE problem may be solved in parallel by decomposing the grid into partitions, and 
mapping partitions to processors. During an iteration a processor updates its grid points, and then 
exchanges with other processors information necessary to compute the next iteration. As pointed out in 
[12], a large number of factors affect the performance of the resulting parallel computation: discretiza- 
tion stencil, partition shape, and parallel architecture. The analysis in [12] quantifies these relationships 
for a wide variety of stencils, shapes, and architectures. Their work throughout assumes that all pro- 
cessors in a parallel system are employed. This paper uses their framework to determine the largest 
possible speedup for a given problem, and to consider the behavior of that optimal speedup as a func- 
tion of problem size when the number of available processors is not limited. These issues are important 
when we consider that users of large scientific codes will always want to solve a larger problem than 
the current technology supports. By focusing on the best possible speedup we are better able to access 
the suitability of various architectures for scaling up to larger problems, and the effects that various 
problem parameters and architecture parameters have on that suitability. 
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We will consider both strip and square partitions; although it is well known that squares have a 
higher computation to communication ratio, situations exist where the use of strips yields better perfor- 
mance than squares [13]. Other authors have employed strips [7] when the number of available proces- 
sors is not a power of 4 (to avoid this last problem, we show that "nearly square" partitions perform 
within a few percentage points of true squares). 

It is a folk theorem among the parallel scientific processing community that good speedup can be 
achieved simply by increasing the size of the problem. In fact, our analysis shows that this is indeed 
true for several different types of architectures, provided that the maximal number of processors is 
fixed. However, by allowing the number of processors (and supporting communication network) to 
grow along with the problem size, it becomes clear that some architectures are better suited for large 
problems than others. Architectures with hypercube or grid communication networks are shown to 
give linear optimal speedup in the grid size n 2 , while bus-oriented networks are shown to give optimal 
speedup which increases at best in the cube root of n 2 . The effect of the relationship between fixed 
communication overhead costs and bus bandwidth is shown to be important. We show that banyan 
type switching networks give optimal speedup which is O(n 2 /log(n)). From these results it is clear that 
bus networks are unsuited for large numerical problems of the type we consider. While hypercubes 
give better asymptotic optimal speedup than banyan networks, the true difference for grid sizes used in 
practice will not depend on the banyan network’s log factor, but on the relative speeds of the commun- 
ication networks. 

2. Previous Work 

Partition geometry plays a key role in determining communication costs, consequently much of 
the literature related to domain decomposition concerns the partition’s geometric shape. Strips, 
squares, triangles, and hexagons have been considered in [4,12,16] on both message-passing and shared 
memory architectures. Reed, Adams and Patrick [12] have done a careful analysis of the relationships 
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between discretization stencils, partition shape, parallel architecture, and data structure management 
Their model determines which stencil/partition/architectures trios are best suited for each other. We will 
introduce their model in the main partion of the paper. Neither the analysis in [12] nor other work con- 
cerning partition shapes has explicitly focused on optimizing the number of processors used, or on the 
behavior of optimal speedup as the problem size increases. 

An analytic study of a conjugate gradient algorithm on the Finite Element Machine (FEM) is 
found in [1]. Their approach to modeling the computation is similar to ours, but is focused entirely on 
the FEM. The difference between the algorithm they study and the class of algorithms we study led to 
different conclusions concerning asymptotic performance. 

Other related work uses a more abstract model of a parallel computation. In [6], Induikya, Stone, 
and Cheng consider the module assignment problem under the assumptions of random module execu- 
tion times and random communication patterns. They explicitly set out to determine the optimal 
number of processors to use. Convenient approximations were made to make the overall execution time 
more tractable; some of these approximations were removed by Nicol in [9], where it is shown that 
Indurkya’s conclusions are basically sound despite the approximations (all of Indurkya’s conclusions 
hold rigorously if module execution times are constant). The cost function studied in that work was 
the sum of execution time with the expected communication overhead. Their somewhat surprising con- 
clusion is that the optimal assignment of modules to processor is extremal: either all modules are 
assigned to one processor, or the modules are distributed as evenly as possible across all available pro- 
cessors. 

The cost model studied by Indurkya ct al. and Nicol fails to capture the potential overlap of com- 
munication and computation in some architectures. Stone [15] also realized this, and gives a thorough 
analysis of a number of simple cost and communication models for the module assignment problem. 
Several of these models allow situations where adding processors increases execution time, so that the 
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optimal assignment need not be extremal. For computations captured by these models, finding the 
optimal number of processors becomes an important issue. Stone uses a parallel solution of the Poisson 
equation to illustrate the relationship of these models to a real problem. His discussion does not treat 
the relative merits of partition geometries and stencils, although he does consider partitioning domain 
rows into pieces. A similar abstract view of this problem is given by Cvetanovic [3]. In contrast, our 
goal in this paper is to show how to optimize the size of a given partition shape for a given PDE on a 
given architecture. We then use the optimal size to characterize the suitability of the architecture for 
large numerical problems. 

3. Model Description 

A square physical domain is discretized into an nxn grid of points, and constant boundary values 
are assumed. Depending on the algorithm used, the value at a grid point Un is updated according to a 
discretization stencil. For example, figure 1 shows a 5-point stencil and a higher order 9-point stencil 
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Figure 1 5-point and 9-point stencils with update equations 
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for the Laplace equation, solved using point Jacobi iteration. The equations clearfy show that the sten- 
cil has a direct impact on the amount of computation performed. A grid partitioned into squares is 
shown in figure 2. From the equations in figure 1, we see that a grid point on the partition boundary of 
one square needs the values of one or more grid points in adjacent squares. Consequently the chosen 
stencil also affects the amount of communication. Since every boundary point must be communicated, 
the perimeter of a partition’s shape affects communication volume. For example, a rectangular strip 
with rn points has 2(r + n) boundary points, while a square partition with rn points has 4 Vrn points; 
2(r + n) > 4 Vrn. Furthermore, some stencils require the communication of more than just one perime- 
ter boundary; for example, see figure 3. Partitions are categorized in [12] with respect to a given sten- 
cil by the number of "perimeters" that must be communicated when the stencil is used. Following this 



Figure 2 Square partitions on grid 
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9 -cross stencil 


13-point stencil 


Figure 3 Stencils requiring more than one perimeter communicated 


idea, we define k(P, S) to be the number of perimeters communicated by partition P using stencil S. 
Some values of k(JP, S) are given below. 


Partition 

Stencil 

k(Partition, Stencil) 

Strip 

5 point 

1 

Square 

5 point 

1 

Strip 

9 cross 

2 

Square 

9 cross 

2 


Assuming that one iteration cannot begin until the last iteration has ended, it is reasonable to 
model the iteration execution time (or cycle time) by 

^ cycle ~ tcomp (2 0) 

where t comp is the computation time of a single partition, t a is the data access/transfer and synchroniza- 
tion time of a single partition. This model is essentially identical to that in [12] and [16] (although we 
have coalesced communication and synchronization times). t a depends on the number of processors 
used and the underlying communication architecture. We will develop specific forms for t a as needed. 
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The computation time t comp depends on the stencil, the solution algorithm, the time to perform a float- 
ing point operation, and the number of grid points in a partition l : 

^comp ~ E(S)'A'Tfp. 

Here E(S) is the number of floating point operations per grid point employed by the algorithm 
(assumed to be constant), A is the number of grid points in a partition, and 7^, is the time for a floating 
point operation. 

n 2 

With A grid points per partition the number of processors used is — . For a given architecture we 

A 

will optimize the number of processors by choosing the value of A which minimizes t cycU , subject to 



Figure 4 Strip partitioning of domain 


*We implicitly assume that the costs of floating point operations strongly dominate the cost of a grid point 
update. Other overhead (such as address calculation and loop indexing) can be added to the model as needed. 
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memory constraints and processor availability constraints. Other constraints concern the partition’s 
shape. Square partitions only admit values of A which are perfect squares, thereby reducing substan- 
tially the number of feasible domain decompositions (and hence freedom in choosing the number of 
processors). Furthermore, it is possible to assign exactly equal work to each processor only if the 
number of processors divides the number of grid points evenly. We will therefore relax the require- 
ments that each partition have exactly the same number of points, and when using square partitions 
relax the requirement that partitions be exactly square. 

It is easy to decompose the domain into strips for P processors: if n — kP + r with 0 <, r <P then 

r processors receive f— ] + 1 contiguous rows, and the remaining processors each receive f— ' 1 con- 
P P 

tiguous rows. As illustrated by figure 4, the number of communicating boundaries is the same as if all 

the partitions have equal work. Square partitions raise harder problems. We will approximate square 

partitions with nearly square rectangles which cover the domain in a nice way. The rectangles are 

arranged in a grid fashion as illustrated in figure 5. The domain is first divided into strips as before; 

then into rectangles by defining a border every mth column. We require that m divide n evenly, and 

call these legal rectangles. For tractability our analysis treats partition execution and communication 

costs as though the partitions are squares. However, empirical studies described below show that the 

error introduced by this assumption is small. 

For a given n it is easy to calculate the area of each legal rectangular partition. For each calcu- 
lated area A we determine the legal rectangle with area A whose perimeter is minimized (several 
different legal rectangles may have the same area). If its perimeter is within 5% of 4^A (the perimeter 
of a square with area A), we retain the rectangle and discard all other rectangles with area A. Otherwise 
we discard all legal rectangles with area A, since none are sufficiently square-like. Each remaining 
rectangle is a working rectangle. Not every area A will have a working rectangle with area A. Now 
suppose we analytically determine that squares with area A optimize performance. We need to find a 
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Figure 5 Rectangular partition of domain 


working rectangle which closely approximates a square with area A. Figure 6a shows the relative 
approximation error in area for a 256 x256 grid when we choose the woiking rectangle with area 
closest to A\ Figure 6b shows the relative approximation error in perimeter. A ranges from 1024 to 
16384 (every even value of A is plotted), reflecting decompositions using 4 to 64 processors. We see 
that the error introduced by this approximation is quite small, usually less than 3% for area and less 
than 6% for perimeter. Similar results were obtained for 128x128, 512x512, and 1024x1024 size grids. 
We can consequently optimize partition area as though partitions are exactly square with the assurance 
that the costs obtained are not far different from costs that are tnily achievable. We next consider this 
optimization for various architecture types. 
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(a) Relative magnitude error in area (b) Relative magnitude error in perimeter 

Figure 6 Bar graphs of approximation errors 


4. Hypercube 

Due to its commercial availability and interesting topological properties, a hypercube architecture 
such as ihe Intel iPSC[l 1] is a natural candidate for PDE solutions. The hypercube’s rich communica- 
tion topology allows the mapping of adjacent strips (or square) partitions onto processors in such a 
way that logically adjacent partitions are mapped onto physically adjacent processors (at least with 
stencils having no diagonals). This property is very important, because it implies that there is no con- 
tention for communication resources between non-logically adjacent partitions. The cost of sending a 
packet of data from one partition to another is independent of the total amount of communication on 
the system. We may model the communication delay of a V byte message from one processor to an 
adjacent processor as 

<. = <x r— r— 1 + P 

packetsne 
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where a is the per packet transmission cost, and p is a startup cost. We assume that the problem size 
is fixed at n 2 , and that if N processors are used, each partition gets n 2 /N points. Thus, as we allow N to 
increase for a fixed value of n 2 , the number of points in a partition decreases, so that both the execu- 
tion cost (t comp ) and the communication/synchronization cost (t a ) for the partition decreases. This 
implies that t^u as defined in equation (1) is a decreasing function of N over the interval [2, n 2 ]. If 
only one processor is used then no communication costs are suffered; if the one processor execution 
cost is still greater than the two processor cost, then using all processors is optimal. If the one proces- 
sor cost is less than the two processor cost, but greater than the cost of using all processors, then using 
all processors is again optimal. The last possibility is that the communication costs are so high that the 
one processor cost is less than the cost of using all processors. In this case, using only one processor 
is optimal. Thus we see that is minimized by either spreading the computation out over as many 
processors as possible, or by placing the whole domain into one processor. If memory limitations 
prohibit the latter option, then the computation should be spread maximally. 

Assume that the grid is spread across all available processors as squares, and consider the effect 
of increasing both n 2 and N in such a way that the number of grid points per processor remains con- 
stant (say F points per processor) as n 2 increases. This implies that the optimal cycle time is the con- 

7 r 4 VF -i „ E(S)n 2 Tfy 

stant 2 C- E(S) F Tfp + 8(| ; — la + P). The optimal speedup is then — — which is 

puc/cctsizc c 

linear in n 2 . 


If the number of processors is fixed at N, the cycle time of a processor is then 


*cycle 


N 


+ 8(F 


packetsize 


where V(n 2 ) denotes the volume of a partition’s communication. V(n 2 ) = 2n for strips and 


2 This expression assumes that only one communication port can be active at a time in a processor, and that 
the communication link is half duplex. 
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V(n 2 ) - 'In 2 IN for squares; it is easily checked that speedup for both squares and strips approaches N 
as n 2 — 

The quick analysis above fails to consider the very important activity of convergence checking. 
A convergence check requires that every updated grid point value be compared with its last value. 
Depending on the convergence criterion employed, another iteration is called for if the updated solution 
is too "different" from the last estimate. Every partition determines whether its subgrid is converged 
and produces either a convergence flag, or a number (e.g. sum of squared update differences over 
subgrid) which must be disseminated throughout the entire network. For small stencils like 5-point, 
the additional computation required to do a convergence check can be 50% of the grid update compu- 
tation. Furthermore, communication during the dissemination stage is not local, and the delay due to 
this stage increases in the number of processors used. Saltz, Naik, and Nicol examine this problem in 
[13], and note that the communication cost for convergence checking is extremely high due to message 
packaging and handling costs. They then give algorithms for scheduling convergence checks; measure- 
ments taken on an Intel iPSC show that despite the potentially very high cost of convergence check- 
ing, these algorithms reduce that cost to an insignificant amount. For the sizes of hypercubes currently 
available, we may safely ignore convergence checking costs in hypercubes. 

5. Grid Architectures 

Parallel architectures have been designed with nearest neighbor communication, e.g. the Illiac IV 
[5], and NASA’s Finite Element Machine (FEM)[1]. The observations made for hypercubes apply 
equally well: the communication costs increase as the partition size increases, implying that the work 
should be spread as evenly as possible or lumped onto one machine (which makes little sense on the 
fore-mentioned machines). This type of machine often provides a global bus, and additional hardware 
for functions such as convergence checking. Provided that such additional hardware exists, the com- 
munication overhead of convergence checking does not appear to be as significant a concern as it is 
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with hypercubes (although the additional computational cost may still be significant). 

Adams and Crockett [1] analyze a conjugate gradient code on the FEM. Each iteration of this 
code requires every processor to send every other processor a number, and a processor adds together 
all such numbers. Eventually adding more processors to a fixed size problem causes this communica- 
tion and addition to dominate performance. The result is that increasing the number of processor past a 
certain threshold increases the algorithm execution time. This highlights the fact that the monotonicity 
we claim for hypercube and grid machines depends very much on the exclusively nearest neighbor 
communication pattern. In the next section we will see that in bus architectures the communication cost 
can actually decrease in increasing partition size, making for a more interesting optimization problem. 

6. Bus Architectures 

Shared memory bus architectures are another important class of commercially available parallel 
processors. Currently, several vendors offer a few tens of processors on a common bus; we denote the 
maximum number of processors available by N. We suppose that the architecture supports local 
memory and global memory, with global memory access being several times slower than local memory 
access (several of the commercial machines do not support this model; they do support caches which, 
if sufficiently laige could be viewed as local memory). We will consider both synchronous and asyn- 
chronous busses: a synchronous bus requires a processor requesting service to : wait until that service is 
completed; an asynchronous bus admits overlapping computation and data writes to the global memory. 
We will see that in both cases contention for global memory via the bus can degrade performance to 
the point where adding processors decreases execution time. 

Reed et al. [12] also observe that a processor’s management of boundary values makes an impor- 
tant difference in performance. Following their advice, we will assume that each processor copies its 
neighbors’ boundary points into local memory at the start of an iteration, and writes its own boundary 
points out to memory at the iteration’s end. In our experience on the FLEX/32 [8], the cost of 
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transferring a word to or from common memory is best modeled (ignoring contention) as c + b, where 
c is a fixed overhead cost due to address calculation and any overhead for accessing the bus, and b is 
the bus cycle time. Because all communication is serialized by a bus, the relative importance of 
different types of communication can be compared by their volume. The cost of communicating con- 
vergence checking information on bus architectures is insignificant because it involves only one 
number from each processor, and is hence ignored here. 


6.1. Synchronous Bus 

We model a synchronous bus and the contention it imposes by assuming that if P processors are 
simultaneously requesting service, the effective delay seen by each processor is c + bP time per float- 
ing point number 3 . The transfer time t a depends on the partition and on P. For strips with area A, each 

n 2 

partition has 2 n boundary points, and — processors simultaneously require bus access. t a for strips is 

A 

consequently given by 


ftrip _ 4 n k(stripS)(c + b 
The cycle time is then 


) = ±i±MDE£L + 4 nc-KstripS). 
A 


<*£ = E(S) A T fp + ±ji±I^n£^l + 4 n c k(stripj). (2) 

Note that the communication costs expressed by equation (2) are decreasing in A , making (2) the sum 
of a convex increasing term and a convex decreasing term. Equation (2) is consequently a convex 
function of A, so that the A minimizing (2) is easily found using calculus. If the A so determined falls 
outside of bounds placed by memory or processor limitations then either the least or the largest admis- 
sible value of A optimizes performance. A is given by 


^or our problem, this assumption yields the same performance as if every processor were able to retain the 
bus for its entire transmission. This follows since one processor will be last to receive the bus; its effective 
communication time is c + bP per floating point number. This model also implicitly assumes that available pro- 
cessors which are not participating in the computation do not significantly interfere with bus service. 
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A 

A = 


4-ir-b-k(stripjS) 

EW-Tfr 

Jr J 


1/2 


(3) 


It is important to note that A depends on most of the problem and architectural parameters assumed by 

2 

the model (the overhead cost c does not affect A). When A > — - is not a multiple of n, we calculate 

N 

A 

A[ = n[ — j, and A k = A t + n. Between these two we choose the area yielding a smaller cycle time; the 

ft 

convexity of (2) ensures that this time is optimal among strips. Substitution of A into (2) gives the 
optimized cycle time when arbitrarily many processors are available. 


= ^E(S)Tf p n i bk{strip,S)^ + ^E(S)Tjp-n 3 b- k(strip ,5) j + 4nck(strip£). 

Here we see that for sufficiently large n (or sufficiently small c) the computation time and the com- 
munication time are essentially identical Then this expression shows what leverage we have in 
improving performance by improving hardware. For example, suppose that we have optimized perfor- 
mance for one set of architectural parameters, and wish to increase processor or bus speed. If we dou- 
ble the speed of the bus, tire minimized cycle time decreases by a factor of 1/V2; the same improve- 
ment is achieved by doubling the speed of a floating point operation. Since the original configuration 
was optimized, these factors bound from above performance gain we can achieve by doubling proces- 
sor or bus speed on any subsequent partitioning of the domain. On the other hand if c is large relative 
to expected problem sizes, then the overhead cost 4 n c k(strip,S) will dominate the communication cost 
so that any speed increase in the bus will not significantly improve performance; on the other hand, 
decreasing c has a linear impact on f$ e . 

„ 2 

Fewer than N processors should be used if • < N. By (3), this is equivalent to 

A 


T fp 4 


(4) 


Inequality (4) gives a simple expression relating hardware characteristics to problem characteristics. If 
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A < — , then the grid should be distributed across all N processors, giving a cycle time of 
N 


E(S}' 

f^ e = — — — + 4nbNk(strip,S) + 4nck(stripsJS ). 

A/ 


Using this expression we calculate speedup 


Speedup% rip = 


NEjSyTfr 

E(S)Tfp + ( 4bN2 + 4c )*(rt rt 'P»S) 


( 5 ) 


which is seen to approach N as n 2 — 

Square partitions are handled similarly. The communication time for a square partition with s 
points per side is 4 


f quart 


%sk(square,S){c + b 


s 1 


) = 8 k(square,S) b — + % s c k(stripsjS), 
s 


a quantity which is always smaller than the corresponding cost for strips with s 2 points. Also note that 
the increasing or decreasing behavior of this cost in s is strongly dependent on the relative values of b 
and c. The cycle time using squares with s points per side is 


= EiSy^Tfp + & k(square£)b— + & sck(strips,S). 


The importance of the relationship between c and b on optimal allocation of processors is illustrated by 
considering necessary conditions under which fewer than all processors are optimally used. 
Differentiating with respect to s and setting equal to zero yields the equation 

E(S) Tfp S 3 + 4 k(square$) Jc'j 2 - 6 « 2 j = 0. 

2 

Now suppose that f^ e is minimized by s 2 = -^-,2 < P < N. Then P processors are employed, and 

''This expression assumes that the number of boundary points a partition writes to global memory is the 
same as the number read in. This is not rigorously true for any stencil which uses diagonals: our expression 
does not count diagonal elements required by the 4 comer points. However, when the number of partition 
points is large relative to the number of processors, this approximation is reasonable. 


-17- 


f is a root of the equation above. Substituting this f 2 back into the equation above, we find that a 
necessary condition on P is that clb < P. Recalling that bus architectures typically have fewer than 30 
processors, we see that this inequality tightly constrains values of b and c. Measurements taken on die 
FLEX/32 suggest that clb ~ 1000, implying that numerical problems run on that machine should use 
all processors. Care in allocating processors is apparently needed more when c is less than b. Conse- 
quently, we now consider the extreme case of c = 0, and the optimal speedups that are achievable 
under that assumption. Note that any speedups so derived serve as upper bounds on speedups gained 
when c * 0. 


If there is no overhead associated with accessing the bus, the optimal square partition size is 
easily shown to be 


f 2 = 


j US 

E{S)T 


The cycle time using S 2 points per partition is 


= (E(^T p ) v \4n 2 bk(sqmre,S)) 2/3 + 2(E(S)Tfi) m (4n 2 -bk(.squareS))™, 


which shows that the communication cost is twice that of the computation cost. This expression also 
shows that we have more leverage by improving communication speed than we do computation speed: 
doubling the speed of the bus gives an cycle time which is 63% of the original; doubling the speed of 
a floating point computation gives an cycle time which is 79% of the original. As with strips, simple 
algebra shows that fewer than N processors should be used if 


N 3/2 b E(S)n 
T fP 4k(square £ ) ’ 


( 6 ) 


Inequalities (4) and (6) show that a strip decomposition of a given problem will always call for 
fewer (or equal) processors than a square decomposition (provided that k(square,S) = k(strip,S )). The 
minimal problem size which uses all N processors is found by treating (6) as an equality, and solving 
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for n. Figure 7 plots the the log (base 2) of the minimal problem size n 2 which gainfully uses all N 
processors, as a function of N. For the parameter values considered we see that a 256x256 grid with 
square partitions and a 5-point stencil should be solved on 1 to 14 processors; the same grid with a 9- 
point stencil should use 1 to 22 processors. The higher computation to communication ratio of the 9- 
point stencil allows more parallelism in computation for the same amount of communication. 

For sufficiently large n 2 allN processors should be employed. The speedup achieved is 


Speedup jj uan = 


NE(S)Tfp 


" n 

which also approaches N as « 2 — *». Comparison of this speedup with speedup for strips (equation (5) 
with c = 0) shows the clear superiority of squares using realistic parameter values and large problems. 
Supposing that E(S) Tjp = b, N - 16, k(strip,S) = lc(squareJS) = 1, and n = 256 the speedup for strips is 

16 


16 

(1 + 512 In) 


= 4, while the speedup for squares is 


(1 + 128/n) 


= 10.6. Increasing the grid to 





(a) Synchronous, Strip 

(b) Asynchronous, Strip 

(c) Synchronous, Square 


Parameter Value 


T„ 

b 


1x10"' 

5x10'* 


•ogjfn 1 ) 



Figure 7 Minimal problem size as function of processors 
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1024x1024 raises the strip speedup to 10.6 and the square speedup to 14.2. 


It is interesting (and straightforward) to calculate the optimal speedup when processors are not 
limited to N. For strips we obtain 

11/2 


1/2 


Speedup = V 


E(S)T fc 


[b'kistripS) J 


This speedup is proportional to (n 2 ) 1/4 , a rather disheartening figure. With squares we fair only some- 
what better. Optimal speedup is 

12/3 


„ 2/3 


Speedup*™ = V 




4bkisquareJ5) 


a figure proportional to (n 2 ) 1/3 . Figure 8 gives speedup curves and processor counts as a function of 
log(n 2 ) for the same problem parameters as addressed by figure 7. These unremarkable speedups sup- 
port the common wisdom that bus architectures do not scale up. This does not negate the utility of 
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Figure 8 Speedup and processors required to achieve speedup 
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these machines: the speedups we calculated for a 16 processor machine on large grids were acceptable. 
However, significantly larger speedups for this same problem are possible using a (larger) hypercube. 
If minimizing the computation’s execution time remains the prime objective then other architectures 
should be considered. 

6 2. Asynchronous Bus 

Better performance can be expected if we are able to overlap communication and computation. 
We next consider an architecture which allows asynchronous writes to global memory, but requires 
processors to wait for completion of their read requests. We then view an iteration as a reading phase, 
followed by a computation phase. During the computation phase, we assume that a boundary value is 
written to global memory as soon as it is updated. To maximize performance, we also assume that 
boundary values are updated before any other points. 

The time required to read the boundary points is exactly half of t a derived in the previous section. 
During the computation phase, a boundary point is updated every E(S) Tjp units of time until all boun- 
dary points have been updated. The time required to update all A points in a partition is E(S)ATf p . If 
at this time the bus has managed to complete all requested writes, then the iteration is finished. Other- 
wise, the iteration does not terminate until the bus services its backlog of boundary value writes. If a 
backlog exists after all points are updated and P processors are in use, then clearly the bus is unable to 
service P boundary value writes in time £(S)7^ Consequently, if a backlog exists, the bus has been 
fully utilized during the entire computation phase. We may therefore write 

t cycle = t read + max[E(.S)AT fi>t bB lola i} (7) 

where t read = tJ2 and B total is the total load (summed over all processors) offered to the bus during the 

iteration. 


For strips with area A, the cycle time is 
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+ max {£($) -A- t /p - 


■nbkistrii 


Again, this function is convex in A, with its minimum precisely where the arguments to the max func- 
tion are equal: 


A _ hf bklstripS) 

TOT» <8) 

The corresponding area given by equation (3) for a synchronous bus is exactly a factor of V2 larger. 

As before, it is easy to show that fewer than N processors should be used if 

jf± £(S) n 
Tfp 2k(strip,S ) ' 

The optimal speedup is given by 


• 

Comparison with the synchronous bus speedup shows that the asynchronous bus speedup is a factor of 
V2 better. 

The cycle time for a square partition with s 2 points is 


square 
1 cycle 


+ m ax(g(S>^-7 ■ 

s JF s 


This is a convex function of s which is minimized when the arguments of the max function are equal: 

f ? I 2 / 3 

j. AbjrJcj^uarejS) 

This area is identical to that calculated for the synchronous bus case. The asynchronous bus optimal 
speedup is 


Speedup 5 **”', = - 2 ^- E(S)T fP 
Speedupr optimal 2 [ 4 . bk(square<S) 

which is 150% larger than the synchronous bus speedup. 


The most interesting thing to note about our asynchronous bus results is their relationship to the 
synchronous bus results. For both strip and square partitions we observe that optimal asynchronous bus 
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performance is a constant (albeit substantial) factor better than synchronous bus performance. Constant 
factor improvement remains even if we relax the requirement that global memory reads are synchro- 
nous (in this case we assume that half the grid points are updated in parallel with the initial read 
requests, the other half in parallel with the boundary writes; this gives an additional 126% improve- 
ment in speedup). The inevitable contention for communication resources, even when conducted in 
parallel with computation and even when fixed overhead is ignored, constrains the optimal speedup to 
be 0((n 2 ) m ) for strips and 0((/i 2 ) 1/3 ) for squares. 

7. Switching Networks 

An important class of parallel machines are those which communicate over a banyan type switch- 
ing network (e.g. IBM RP3 [10], BBN Butterfly[2] ). For a fixed sized network it is messy to do an 
exact analysis of the communication delay suffered by a partition as a function of processors used. To 
simplify things we make the following assumptions: 

(1) The number of global memory modules is equal to the number of processors; 

(2) Each processor has local memory, and only boundary values are stored in global memory; 

(3) The network switches are 2 by 2; 

(4) The network is sufficiently fast so that we can ignore contention while boundary values are asyn- 
chronously written to global memory. 

Item (1) does not make any assumptions about the location of the global memory modules. They may 
be resident in processors (as with the BBN Butterfly) or not. Assumption (2) is used because the study 
in [12] shows that performance can be much better if local memory is employed. Assumption (3) 
allows us to avoid switch contention under certain circumstances. Assumption (4) is reasonable, since 
we may also schedule the times at which processors write to memory to further avoid contention. It is 
convenient to assume that all of the boundary values a partition reads are stored in the same global 
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memory module, different from any other partition’s. When a processor writes its boundary values, it 
writes them to the different modules of processors which use those values. Then it is possible to 
assign these modules to partitions in such a way that no contention at switches is ever incurred by any 
boundary value read (presuming all partitions read concurrently). Under these assumptions the global 
memory access time for a read is 


t a = 2wlog 2 (A0 

where w is the speed of a switch, and the factor of two reflects two trips across the network. An itera- 
tion consists of a phase of reading boundary values, followed by a computing phase. During the com- 
puting phase the boundary points are written asynchronously back to global memory. The cycle time 
for strip partitions with A points is given by 


= 4nk(stripS)w\og2(N) + E(S)AT^,. 

As a function of A, the cycle time is minimized when A is minimized, meaning that all available pro- 
cessors are employed. Similaily, the cycle time for square partitions with s 2 points is 

$3T = 8swlog 2 (A0 + EGSyf-Tp. 

This latter time is increasing in s , and so is minimized when s is minimized. Like the hypercube, we 
see that problems mapped onto inter-connection networks ought to be lumped onto one processor, or 
distributed as completely as possible across all processors . 


We now allow the size of the parallel system to increase with increasing problem size. For square 
partitions we fix F points per processor, making the cycle time 


f$u e = 8'iFk(square,S)wlog 2 


V' 


EiSyFT^ 


giving O 


log(n) 


speedup, which is nearly linear in the problem size. Strip partitions force an increas- 


n 


log(n) 


ing number of points per processor, and have 0 


optimal speedup. 
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These switching network speedups differ from the hypercube speedups only by a factor of 
l/log(n); a factor which arises from the growing number of stages of the switching network as the 
problem grows. For the size of problems treatable in the near future, this log factor will not be as 
significant in determining performance as is switching network speed (for banyan networks), and mes- 
sage packaging costs (for hypercubes and grids). 

8. Conclusions 

A number of factors influence the performance of an elliptic PDE solution on a parallel architec- 
ture. Reed et al. [12] detail the interactions of stencil, partition, and architecture; we use their frame- 
work to look at issues in processor allocation, and maximum possible speedup. For various types of 
architectures we developed equations describing execution time; invariably these functions turned out 
to be convex in the number of grid points assigned to a processor. This convexity shows that the best 
assignment of grid points to processors either (1) uses as few processors as possible, (2) uses as many 
processors as possible, or (3) there is a unique preferred assignment which does not use all available 
processors, and is easily determined using calculus. We show that for any collection of model parame- 
ter values, optimal performance on hypercubes, grid-like, and switching network types of architectures 
is achieved either by spreading the problem grid across all processors, or by forcing the grid into as 
few processors as possible. This result depends heavily on the fact that communication for the algo- 
rithm studied is strictly nearest neighbor, existing studies [1] provide counter-examples for other com- 
munication patterns. For our problem, both synchronous and asynchronous bus architectures allow for 
optimal assignments which do not use all processors. However, we showed that in order for this situa- 
tion to arise, the fixed overhead cost of communicating a word on the bus must be nearly as small as 
the bus cycle time. Our formulas predict the smallest grid size which needs all available processors to 
perform optimally; they also give upper bounds on the optimal speedup possible. We noted that bus 
architectures can achieve acceptable speedup on reasonably sized grids, despite the potential for rela- 
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tively high contention for global memory. Also, by looking at optimized execution times on bus archi- 
tectures, we identify the leverage on performance given by increasing processor or network communi- 
cation speed. 

We also examined the suitability of these architectures for solving increasingly large problems. It 
is seen that for any of the fore-mentioned architectures with N fixed processors, the speedup 
approaches N as the grid size increases. More interesting is the behavior of optimal speedup when we 
let the architecture grow with the problem size. There we find that square partitions are strongly pre- 
ferred over strip partitions; that hypercube speedups grow linearly in n 2 , switching network speedups 
grow proportionally to n 2 /log(n), and that bus architecture speedups grow only as (n 2 ) 1/3 , even if bus 
access is completely asynchronous. Table I summarizes the optimal speedup in n 2 as a function of 
architecture (square partitions are assumed, one point per processor when appropriate ). 

Most of our results come as no surprise, they merely substantiate what is commonly thought 
about each of these architectures. The implications of these results are simply that communication 
volume and contention should be avoided as much as possible. Consider that when processors are no 
constraint, strip partitions have a communication volume which is a square root of the computation 
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Table I Summary of Optimal Speedups 
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volume. At best, we can expect speedup to grow in the square root of the computation volume. 
Allow contention proportional to total communication volume (summed over all partitions), and the 
optimal speedup drops to the fourth root of n 2 . Even for squares, allowance of such contention restricts 
speedup to a cube root of n 2 . The clear implication is that contention eventually causes serious perfor- 
mance degradation; our analysis shows how bad that degradation can be. It is also interesting to note 
the rather limited leverage we have on improving bus architecture performance by increasing processor 
or communication network speed: reducing the floating point time by 1/A: decreases optimal execution 

time only by a similar reduction in bus time reduces optimal execution time by O n Hie 


other hand for strip partitions, reducing the fixed overhead cost of communication decreases optimal 
execution time linearly. 

One possible means for reducing contention is to use clever scheduling to access communication 
resources. We have not yet explored this possibility, but suggest that it is important to do so given the 
significance of the degradation our analysis predicts. Future effort will be devoted to verifying our 
analysis empirically, and to investigate the fore-mentioned scheduling issues. 
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